{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 有效使用内存子系统"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在您可以编写正确的 CUDA 核函数了，并且认识到，重要的是启动合适的线程网格从而为 GPU 提供足够的机会隐藏延迟。接下来您将学习有效利用 GPU 内存子系统的技术，这些技术广泛适用于各种 CUDA 应用程序，其中一些最重要的技术可以让您的 CUDA 代码运行得更快。\n",
    "\n",
    "您将从学习内存合并开始。为了挑战您理解内存合并的能力，并展示与许多 CUDA 应用程序相关的重要细节，您将学习二维网格和线程块。接下来，您将学习一种非常快速的、用户控制的、按需分配的内存空间，称为共享内存。您将使用共享内存来为实现内存合并提供可行的方法，否则内存合并是不可能的。最后，您将了解共享内存区的冲突（它会破坏使用共享内存提高性能的可能性），以及学习解决冲突的技术。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 目标"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "完成本部分内容后，您将能够：\n",
    "* 编写受益于合并内存访问模式的 CUDA 核函数。\n",
    "* 使用多维网格和线程块。\n",
    "* 使用共享内存来协调块内的线程。\n",
    "* 使用共享内存来促进合并内存访问模式。\n",
    "* 解决共享内存区的冲突。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 问题：未合并的内存访问会影响性能"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在您了解有关 **合并的内存访问** 的详细信息之前，请运行以下单元格并观察核函数中看似微不足道的对数据访问模式的更改是如何影响性能的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 生成数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这个单元格中，我们定义了 `n` 并创建了一个线程等于 `n` 的网格。 我们还创建了一个长度为 `n` 的输出向量。 \n",
    "对于输入，我们创建大小为 `stride * n` 的向量，原因如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 1024*1024 # 1M\n",
    "\n",
    "threads_per_block = 1024\n",
    "blocks = int(n / threads_per_block)\n",
    "\n",
    "stride = 16\n",
    "\n",
    "# Input Vectors of length stride * n\n",
    "a = np.ones(stride * n).astype(np.float32)\n",
    "b = a.copy().astype(np.float32)\n",
    "\n",
    "# Output Vector\n",
    "out = np.zeros(n).astype(np.float32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_b = cuda.to_device(b)\n",
    "d_out = cuda.to_device(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 核函数的定义"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在`add_experiment`中，网格中的每个线程都会将`a`中的一个数据项和`b`中的一个数据项加起来，并将结果写入`out`。 我们把核函数编写成可以传递`coalesced`（`合并`）的真值`True`或`False`（`真`或`假`），它决定了如何取`a`向量和`b`向量的索引值。您将在下面看到两种模式的性能比较。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def add_experiment(a, b, out, stride, coalesced):\n",
    "    i = cuda.grid(1)\n",
    "    # The above line is equivalent to\n",
    "    # i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
    "    if coalesced == True:\n",
    "        out[i] = a[i] + b[i]\n",
    "    else:\n",
    "        out[i] = a[stride*i] + b[stride*i]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 以合并内存访问启动核函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这里，我们把 `True` 传递给 `coalesced` 参数，并在多次运行中观察核函数的性能："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, True); cuda.synchronize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，我们检查核函数是否按预期正确地运行："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_out.copy_to_host()\n",
    "truth = a[:n] + b[:n]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(result, truth)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 以非合并内存访问启动核函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这个单元格中，我们传递 `False`给参数`coalesced`，以观察 `add_experiment` 的未合并数据访问模式的性能："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, False); cuda.synchronize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这里，我们确保核函数按预期运行："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_out.copy_to_host()\n",
    "truth = a[::stride] + b[::stride]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(result, truth)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 4\n",
    "stride = 16\n",
    "\n",
    "# Input Vectors of length stride * n\n",
    "a = np.arange(stride * n)\n",
    "b = a.copy().astype(np.float32)\n",
    "\n",
    "a[::stride]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 结论"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "非合并数据访问模式的性能要差得多。现在，您将了解这是为什么，以及如何在核函数中考虑数据访问模式以获得高性能。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 讲稿: 全局内存的合并访问"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "执行以下单元格以加载幻灯片，然后单击\"Start Slide Show\"（“开始幻灯片放映”）可使其全屏显示。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1-ZH/coalescing-v3.pptx', 800, 450)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> _**脚注**：有关跨各种设备的全局内存段大小以及缓存的其他详细信息，请参阅 [CUDA 最佳实践指南](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory)。_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 练习：对列和对行求和"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在本练习中，您将编写一个使用完全合并内存访问模式的对矩阵的列元素求和的核函数。首先，您先观察以非合并内存访问的对行元素求和的核函数的性能。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对行元素求和"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**导入库**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**生成数据**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这个单元格中，我们创建一个输入矩阵，以及一个用于存储结果的向量，并将它们传输到设备内存里。我们还定义了在下面启动核函数时要使用的网格和块的尺寸。我们将任意一行数据设置为某个任意值，以方便下面检查结果的正确性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 16384 # matrix side size\n",
    "threads_per_block = 256\n",
    "blocks = int(n / threads_per_block)\n",
    "\n",
    "# Input Matrix\n",
    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
    "# Here we set an arbitrary row to an arbitrary value to facilitate a check for correctness below.\n",
    "a[3] = 9\n",
    "\n",
    "# Output vector\n",
    "sums = np.zeros(n).astype(np.float32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_sums = cuda.to_device(sums)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**核函数的定义**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`row_sums` 将使用每个线程迭代一行数据，将其求和，然后将结果存储在 `sums` 中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def row_sums(a, sums, n):\n",
    "    idx = cuda.grid(1)\n",
    "    sum = 0.0\n",
    "    \n",
    "    for i in range(n):\n",
    "        # Each thread will sum a row of `a`\n",
    "        sum += a[idx][i]\n",
    "        \n",
    "    sums[idx] = sum"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**对行求和的性能**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%timeit row_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查正确性**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_sums.copy_to_host()\n",
    "truth = a.sum(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(truth, result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对列求和"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**导入库**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**生成数据**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这个单元格中，我们创建一个输入矩阵，以及一个用于存储结果的向量，并将它们传输到设备内存里。我们还定义了在下面启动核函数时要使用的网格和块的尺寸。我们将任意一列数据设置为某个任意值，以方便下面检查结果的正确性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 16384 # matrix side size\n",
    "threads_per_block = 256\n",
    "blocks = int(n / threads_per_block)\n",
    "\n",
    "a = np.ones(n*n).reshape(n, n).astype(np.float32)\n",
    "# Here we set an arbitrary column to an arbitrary value to facilitate a check for correctness below.\n",
    "a[:, 3] = 9\n",
    "sums = np.zeros(n).astype(np.float32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_sums = cuda.to_device(sums)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**核函数的定义**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`col_sums` 将使用每个线程迭代一列数据，对其求和，然后将结果存储在 `sums` 中。请您完成核函数的定义，实现这个操作。如果您遇到困难，请随时参考 [答案](../edit/solutions/col_sums_solution.py)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def col_sums(a, sums, ds):\n",
    "    # TODO: Write this kernel to store the sum of each column in matrix `a` to the `sums` vector.\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查性能**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设您已经编写了 `col_sums` 来使用合并访问模式，与上面运行的使用非合并访问模式的 `row_sums` 相比，您应该会看到显著的（几乎是 2 倍）加速："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%timeit col_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查准确性**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "确保您的核函数运行正确。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_sums.copy_to_host()\n",
    "truth = a.sum(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(truth, result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 维和 3 维线程块和网格"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "网格和块都可以配置为分别包含块或线程的 2 维或 3 维集合。 这样做主要是为了方便经常使用 2 维或 3 维数据集的程序员。 这里有一个非常简单的例子来描述相应的语法。为了理解这些概念，您可能需要阅读核函数的定义**以及**如何启动该核函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "A = np.zeros((4,4)) # A 4x4 Matrix of 0's\n",
    "d_A = cuda.to_device(A)\n",
    "\n",
    "# Here we create a 2D grid with 4 blocks in a 2x2 structure, each with 4 threads in a 2x2 structure\n",
    "# by using a Python tuple to signify grid and block dimensions.\n",
    "blocks = (2, 2)\n",
    "threads_per_block = (2, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该核函数以值全为 0 的矩阵为输入，并在矩阵的每个元素位置写入新的值。元素的位置就是用网格内以`X.Y`的格式表示的线程的 (x,y) 坐标："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def get_2D_indices(A):\n",
    "    # By passing `2`, we get the thread's unique x and y coordinates in the 2D grid\n",
    "    x, y = cuda.grid(2)\n",
    "    # The above is equivalent to the following 2 lines of code:\n",
    "    # x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
    "    # y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
    "    \n",
    "    # Write the x index followed by a decimal and the y index.\n",
    "    A[x][y] = x + y / 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "get_2D_indices[blocks, threads_per_block](d_A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_A.copy_to_host()\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 练习：以合并内存访问实现二维矩阵相加"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 生成数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在下面的单元格中，我们定义了有 2048x2048 个元素的输入矩阵 `a` 和 `b`，以及一个 2048x2048 的初始化为 0 的输出矩阵。我们将这些矩阵复制到设备内存里。\n",
    "\n",
    "我们还定义了下面要使用的二维块和网格。请注意，我们创建的网格内的总线程数与输入和输出矩阵的元素数相同，因此网格中的每个线程都将计算2个数之和并放进输出矩阵中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 2048*2048 # 4M\n",
    "\n",
    "# 2D blocks\n",
    "threads_per_block = (32, 32)\n",
    "# 2D grid\n",
    "blocks = (64, 64)\n",
    "\n",
    "# 2048x2048 input matrices\n",
    "a = np.arange(n).reshape(2048,2048).astype(np.float32)\n",
    "b = a.copy().astype(np.float32)\n",
    "\n",
    "# 2048x2048 0-initialized output matrix\n",
    "out = np.zeros_like(a).astype(np.float32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_b = cuda.to_device(b)\n",
    "d_out = cuda.to_device(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2-维矩阵的加法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你的工作是完成 `matrix_add` 中的 TODO 部分，将 `a` 和 `b` 正确地求和，结果给 `out`。 作为对合并访问模式理解的挑战，`matrix_add` 将接受一个 `coalesced` 布尔值，指示是否应该使用合并访问模式。两种模式，即 coalesced（合并） 和 uncoalesced（非合并），都应该产生正确的结果，但是，在将 `coalesced` 设置为 `True` 的情况下运行时，你应该观察到有显著的加速。\n",
    "\n",
    "如果您遇到困难，请随时查看[答案](../edit/solutions/matrix_add_solution.py)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def matrix_add(a, b, out, coalesced):\n",
    "    # TODO: set x and y to index correctly such that each thread\n",
    "    # accesses one element in the data.\n",
    "    x, y = pass\n",
    "    \n",
    "    if coalesced == True:\n",
    "        # TODO: write the sum of one element in `a` and `b` to `out`\n",
    "        # using a coalesced memory access pattern.\n",
    "    else:\n",
    "        # TODO: write the sum of one element in `a` and `b` to `out`\n",
    "        # using an uncoalesced memory access pattern."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 检查性能"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "运行下面的两个单元格以使用您写入的以合并和非合并访问模式启动`matrix_add`，并观察性能差异。后面还有额外的单元来确认您的核函数的正确性。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**合并访问**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, True); cuda.synchronize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_out.copy_to_host()\n",
    "truth = a+b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(result, truth)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**非合并访问**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, False); cuda.synchronize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_out.copy_to_host()\n",
    "truth = a+b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(result, truth)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 共享内存"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "到目前为止，我们一直在区分主机内存和设备内存，就好像设备内存只是一种内存一样。但实际上，CUDA 有更细粒度的[内存层次结构](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy)。到目前为止，我们一直在使用的设备内存称为**全局内存**，可用于设备上的任何线程或块，可以在应用程序的整个生命周期内持续存在，并且是一个相对较大的内存空间。\n",
    "\n",
    "我们现在将讨论如何利用称为**共享内存**的片上设备内存区域。共享内存是程序员定义的容量有限的缓存，容量大小[取决于GPU](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) ，并且只在同一个块中的所有线程之间所**共享**。它是一种稀缺资源，不能由分配该共享内存的块之外的线程所访问，并且在核函数完成执行后不会持续存在。然而，共享内存的带宽比全局内存高得多，可以在许多核函数中发挥巨大作用，尤其是在优化性能方面。\n",
    "\n",
    "以下是共享内存的一些常见用例：\n",
    "\n",
    " * 缓存需要在一个块内多次读取的全局内存里的内容。\n",
    " * 缓存线程的输出，以便在将其回写至全局内存之前先进行合并。\n",
    " * 为块内的分散/收集操作暂存数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 共享内存的语法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numba 提供了[相应的函数](https://numba.pydata.org/numba-doc/dev/cuda/memory.html#shared-memory-and-thread-synchronization)用于分配共享内存以及块内线程之间的同步。在并行线程读取或写入共享内存后，同步通常是必需的。\n",
    "\n",
    "在声明共享内存时，您需要提供共享数组的形状，并指定数据元素的 [Numba 类型](https://numba.pydata.org/numba-doc/dev/reference/types.html#numba-types)。 **数组的形状必须是一个常量值**，因此，您不能使用传递给函数的参数，或者是像`numba.cuda.blockDim.x`这样的变量，或者是`cuda.griddim`的计算值。下面是一个复杂的例子来演示如何使用这个语法，其中的注释指出从主机内存到全局设备内存，再到共享内存，然后回到全局设备内存，最后回到主机内存的数据迁移过程："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**导入库**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们将使用 `numba.types` 来定义共享内存中值的类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import types, cuda"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**使用共享内存交换元素**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面的核函数有一个输入向量`vector`，每个线程首先将该向量的一个元素写入共享内存，然后执行同步，等所有元素都已写入共享内存后，再将共享内存中的一个元素写入输出向量`swapped`。\n",
    "\n",
    "值得注意的是，一个线程把共享内存中的一个值写入`swapped`向量，该值是被另一个线程写入共享内存的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def swap_with_shared(vector, swapped):\n",
    "    # Allocate a 4 element vector containing int32 values in shared memory.\n",
    "    temp = cuda.shared.array(4, dtype=types.int32)\n",
    "    \n",
    "    idx = cuda.grid(1)\n",
    "    \n",
    "    # Move an element from global memory into shared memory\n",
    "    temp[idx] = vector[idx]\n",
    "    \n",
    "    # cuda.syncthreads will force all threads in the block to synchronize here, which is necessary because...\n",
    "    cuda.syncthreads()\n",
    "    #...the following operation is reading an element written to shared memory by another thread.\n",
    "    \n",
    "    # Move an element from shared memory back into global memory\n",
    "    swapped[idx] = temp[3 - cuda.threadIdx.x] # swap elements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**生成数据**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "vector = np.arange(4).astype(np.int32)\n",
    "swapped = np.zeros_like(vector)\n",
    "\n",
    "# Move host memory to device (global) memory\n",
    "d_vector = cuda.to_device(vector)\n",
    "d_swapped = cuda.to_device(swapped)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**执行核函数**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "swap_with_shared[1, 4](d_vector, d_swapped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查结果**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Move device (global) memory back to the host\n",
    "result = d_swapped.copy_to_host()\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 讲稿：用于内存合并的共享内存"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "执行以下单元格以加载幻灯片，然后单击\"Start Slide Show\"（“开始幻灯片放映”）可使其全屏显示。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1-ZH/shared_coalescing.pptx', 800, 450)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 练习：利用共享内存实现矩阵转置的合并读和合并写"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在本练习中，您将通过编写矩阵转置核函数来实现讲稿中刚刚演示的内容，该核函数使用共享内存对全局内存中的输入输出矩阵进行合并读取和合并写入。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 合并读，非合并写"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "作为参考和性能比较，这里有一个简单的矩阵转置核函数，它从输入矩阵读取数据时是合并读，但写入输出矩阵时是非合并写。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**导入库**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import cuda\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**生成数据**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里我们创建了一个 4096x4096 的输入矩阵 `a` 和一个 4096x4096 的输出矩阵 `transposed`，并将它们复制到设备内存中。\n",
    "\n",
    "我们还定义了一个带有二维块的二维网格，将在下面使用。 请注意，我们创建了一个网格，其线程总数等于输入矩阵中的元素数量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 4096*4096 # 16M\n",
    "\n",
    "# 2D blocks\n",
    "threads_per_block = (32, 32)\n",
    "#2D grid\n",
    "blocks = (128, 128)\n",
    "\n",
    "# 4096x4096 input and output matrices\n",
    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
    "transposed = np.zeros_like(a).astype(np.float32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_transposed = cuda.to_device(transposed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**简单的矩阵转置实现**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该核函数正确转置了 `a`，将转置后的元素写入了 `transposed`。 它以合并的方式从 `a` 读取数据，然而，它对 `transposed` 的写入是非合并的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def transpose(a, transposed):\n",
    "    x, y = cuda.grid(2)\n",
    "\n",
    "    transposed[x][y] = a[y][x]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查性能**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%timeit transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查正确性**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_transposed.copy_to_host()\n",
    "expected = a.T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(result, expected)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 为合并读和合并写而改写程序"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "您的工作将是重写 `transpose` 核函数以使用共享内存，并以合并的方式从全局内存读取和写入。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**导入库**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda, types as numba_types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**生成数据**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 4096*4096 # 16M\n",
    "\n",
    "# 2D blocks\n",
    "threads_per_block = (32, 32)\n",
    "#2D grid\n",
    "blocks = (128, 128)\n",
    "\n",
    "# 4096x4096 input and output matrices\n",
    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
    "transposed = np.zeros_like(a).astype(np.float32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_transposed = cuda.to_device(transposed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**编写一个使用共享内存的矩阵转置核函数**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "完成 `tile_transpose` 核函数定义中的 TODO 部分。\n",
    "\n",
    "如果您遇到困难，请随时查看[答案](../edit/solutions/tile_transpose_solution.py)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def tile_transpose(a, transposed):\n",
    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
    "    # and that `a` is a multiple of these dimensions.\n",
    "    \n",
    "    # 1) Create 32x32 shared memory array.\n",
    "    \n",
    "    # TODO: Your code here.\n",
    "\n",
    "    # Compute offsets into global input array. Recall for coalesced access we want to map threadIdx.x increments to\n",
    "    # the fastest changing index in the data, i.e. the column in our array.\n",
    "    # Note: `a_col` and `a_row` are already correct.\n",
    "    a_col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
    "    a_row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
    "    \n",
    "    # 2) Make coalesced read from global memory (using grid indices)\n",
    "    # into shared memory array (using thread indices).\n",
    "    \n",
    "    # TODO: Your code here.\n",
    "\n",
    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
    "    \n",
    "    # TODO: Your code here.\n",
    "    \n",
    "    # 4) Calculate transposed location for the shared memory array tile\n",
    "    # to be written back to global memory. Note that blockIdx.y*blockDim.y \n",
    "    # and blockIdx.x* blockDim.x are swapped (because we want to write to the\n",
    "    # transpose locations), but we want to keep access coalesced, so match up the\n",
    "    # threadIdx.x to the fastest changing index, i.e. the column./\n",
    "    # Note: `t_col` and `t_row` are already correct.\n",
    "    t_col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
    "    t_row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
    "\n",
    "    # 5) Write from shared memory (using thread indices)\n",
    "    # back to global memory (using grid indices)\n",
    "    # transposing each element within the shared memory array.\n",
    "    \n",
    "    # TODO: Your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查性能**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "检查重构后的矩阵转置核函数的性能。与上面的基准转置性能相比，您应该会看到加速。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%timeit tile_transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**检查准确性**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_transposed.copy_to_host()\n",
    "expected = a.T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(result, expected)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 为什么这么小的改进?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "虽然对于仅仅几行代码来说这是一个显著的加速，但是您可能会认为，对比先前的合并访问模式的性能改进，现在的性能改进并不像您预期的那么明显。这主要有两个原因：\n",
    "\n",
    "1. 简单的矩阵转置核函数已经进行了合并读取，因此，您的重构版本仅优化了整个核函数执行过程中的全局内存访问的一半。\n",
    "2. 您编写的代码存在称为共享内存区冲突的问题，我们现在将关注这个问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 讲稿：内存区冲突"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "执行以下单元格以加载幻灯片，然后单击\"Start Slide Show\"（“开始幻灯片放映”）可使其全屏显示。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1-ZH/bank_conflicts.pptx', 800, 450)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 评估：解决内存区冲突"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "作为最后的练习，为了在课程的最后一部分中获得证书，您将使用共享内存重构矩阵转置核函数，使其没有共享内存区的冲突。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda, types as numba_types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 生成数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 4096*4096 # 16M\n",
    "threads_per_block = (32, 32)\n",
    "blocks = (128, 128)\n",
    "\n",
    "a = np.arange(n).reshape((4096,4096)).astype(np.float32)\n",
    "transposed = np.zeros_like(a).astype(np.float32)\n",
    "\n",
    "d_a = cuda.to_device(a)\n",
    "d_transposed = cuda.to_device(transposed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实现无共享内存区冲突的核函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`tile_transpose_conflict_free`是一个工作正常的矩阵转置核函数，它利用了共享内存，以便实现对全局内存的合并读取和合并写入。您的工作是重构该核函数，使其不受内存区冲突的影响。\n",
    "\n",
    "**注意：** 因为这个最后的练习计入课程的认证分数，所以我们将不提供答案。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def tile_transpose_conflict_free(a, transposed):\n",
    "    # `tile_transpose` assumes it is launched with a 32x32 block dimension,\n",
    "    # and that `a` is a multiple of these dimensions.\n",
    "    \n",
    "    # 1) Create 32x32 shared memory array.\n",
    "    tile = cuda.shared.array((32, 32), numba_types.int32)\n",
    "\n",
    "    # Compute offsets into global input array.\n",
    "    x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
    "    y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n",
    "    \n",
    "    # 2) Make coalesced read from global memory into shared memory array.\n",
    "    # Note the use of local thread indices for the shared memory write,\n",
    "    # and global offsets for global memory read.\n",
    "    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a[y, x]\n",
    "\n",
    "    # 3) Wait for all threads in the block to finish updating shared memory.\n",
    "    cuda.syncthreads()\n",
    "    \n",
    "    # 4) Calculate transposed location for the shared memory array tile\n",
    "    # to be written back to global memory.\n",
    "    t_x = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x\n",
    "    t_y = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y\n",
    "\n",
    "    # 5) Write back to global memory,\n",
    "    # transposing each element within the shared memory array.\n",
    "    transposed[t_y, t_x] = tile[cuda.threadIdx.x, cuda.threadIdx.y]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 检查性能"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设您已正确解决了内存区冲突的问题，那么该核函数的运行速度应该比简单的矩阵转置核函数和共享内存（有内存区冲突）的矩阵转置核函数要快得多。为了通过评估，您的核函数的平均运行时间需要少于 840 微秒。\n",
    "\n",
    "通过运行以下单元格打印的第一个值将为您提供该核函数的平均运行时间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%timeit tile_transpose_conflict_free[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 检查准确性"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了通过评估，您的核函数还需要正常工作。运行以下 2 个单元格以验证确实如此。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = d_transposed.copy_to_host()\n",
    "expected = a.T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.array_equal(result, expected)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 运行评估"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果您已完成重构工作，并观察到它的运行时间少于 840 微秒，而且确认它运行正确，那么请执行以下单元格，以根据您的核函数的定义运行评估程序。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from assessment import assess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "assess(tile_transpose_conflict_free)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 为您的工作获得分数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "成功通过上述评估后，请重新访问您启动此交互环境的网页，然后单击 **“ASSESS TASK”** 按钮，如下面的屏幕截图所示。这样做将使您在课程的这一部分获得学分，并计入获得整个课程的**能力证书**的分数里。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![运行评估](images/run_the_assessment.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 总结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "您已经完成了本课程，您能够：\n",
    "\n",
    "* 编写受益于合并内存访问模式的 CUDA 核函数。\n",
    "* 使用多维网格和线程块。\n",
    "* 使用共享内存来协调块内的线程。\n",
    "* 使用共享内存来促进合并内存访问模式。\n",
    "* 解决共享内存区的冲突。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 下载课程内容"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要下载此 notebook 的内容，请执行以下单元格，然后单击下面的下载链接。注意：如果你在本地 Jupyter 服务器上运行这个 notebook，你会发现 notebook 中的一些文件路径链接无法使用，因为它们只适合我们的课程平台。不过，您仍然可以通过 Jupyter 文件导航器导航到那些文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!tar -zcvf section3.tar.gz ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[下载这部分课程的内容](files/section3.tar.gz)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
